Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
نویسندگان
چکیده
Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recogni-
منابع مشابه
A First Look at Music Composition using LSTM Recurrent Neural Networks
In general music composed by recurrent neural networks (RNNs) suffers from a lack of global structure. Though networks can learn note-by-note transition probabilities and even reproduce phrases, attempts at learning an entire musical form and using that knowledge to guide composition have been unsuccessful. The reason for this failure seems to be that RNNs cannot keep track of temporally distan...
متن کاملLearning the Long-Term Structure of the Blues
In general music composed by recurrent neural networks (RNNs) suffers from a lack of global structure. Though networks can learn note-by-note transition probabilities and even reproduce phrases, they have been unable to learn an entire musical form and use that knowledge to guide composition. In this study, we describe model details and present experimental results showing that LSTM successfull...
متن کاملProposing Multimodal Integration Model Using LSTM and Autoencoder
We propose an architecture of neural network that can learn and integrate sequential multimodal information using Long Short Term Memory. Our model consists of encoder and decoder LSTMs and multimodal autoencoder. For integrating sequential multimodal information, firstly, the encoder LSTM encodes a sequential input to a fixed range feature vector for each modality. Secondly, the multimodal aut...
متن کاملAudio-Visual Correlation Modeling for Speaker Identification and Synthesis
This thesis addresses two major problems of multimodal signal processing using audiovisual correlation modeling: speaker recognition and speaker synthesis. We address the first problem, i.e., the audiovisual speaker recognition problem within an open-set identification framework, where audio (speech) and lip texture (intensity) modalities are fused employing a combination of early and late inte...
متن کاملIncremental Multimodal Feedback for Conversational Agents
Just like humans, conversational computer systems should not listen silently to their input and then respond. Instead, they should enforce the speaker-listener link by attending actively and giving feedback on an utterance while perceiving it. Most existing systems produce direct feedback responses to decisive (e.g. prosodic) cues. We present a framework that conceives of feedback as a more com...
متن کامل